fix(approx_fns): use exact percentile when no compression#21388
Open
aryan-212 wants to merge 3 commits intoapache:mainfrom
Open
fix(approx_fns): use exact percentile when no compression#21388aryan-212 wants to merge 3 commits intoapache:mainfrom
aryan-212 wants to merge 3 commits intoapache:mainfrom
Conversation
22718b8 to
e997594
Compare
40f862d to
95a4eff
Compare
d8339ff to
4f86249
Compare
Contributor
Author
How Databricks treats
|
| Function | Semantics | Behavior |
|---|---|---|
percentile / percentile_cont |
Continuous — interpolates between adjacent values | median([1, 2]) = 1.5 |
percentile_approx / approx_percentile |
Discrete — returns an actual observed value from the dataset | approx_median([1, 2]) = 1 |
This was verified by running the equivalent window query on Databricks against the same 21-row dataset used in DataFusion's window_using_aggregates test. The Databricks output confirmed that percentile_approx picks the nearest-rank value (no interpolation), while percentile interpolates.
f0ef5fd to
48e384b
Compare
48e384b to
57fae2e
Compare
nimalan-e6x
approved these changes
Apr 7, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
DataFusion's
approx_percentile_cont/approx_medianuse a t-digest internally. The t-digest's interpolation step assumes centroids represent clusters of multiple points. But if the number of input rows is small (≤ the digest'smax_size/ compression threshold), no compression ever happens: every centroid has weight 1 and corresponds to exactly one input value.In that regime, interpolation is not just unnecessary, it is actively wrong. The t-digest interpolates between adjacent centroids based on where the rank falls inside the centroid's weight, using half-deltas to neighbors. When every centroid has weight 1, this produces values that drift away from any actual data point.
This is particularly surprising for users running small queries or unit tests, they expect percentile functions on a handful of values to return one of those values.
Concrete Example
Let's take a small example from the TPCDS Schema
Now if we take a small
APPROX_PERCENTILEquery like:From here,
0.85 * 14yields 11.9 or 12 so the output for the aboveAPPROX_PERCENTILEquery should be84336and that is what we get when we run the same query in DatabricksBut in DataFusion this comes up as:
This PR aims to fix this.
What was wrong before
Prior to this change, when no t-digest compression occurred,
estimate_quantilestill ran the t-digest interpolation path. This produced values that were:percentile_cont)percentile_approx/ Databricks)For example,
approx_medianon the 10-value window frame[-85, -72, -56, -48, -43, -25, -12, -5, 45, 83]returned-32— not-34(the true continuous median) and not-43(the discrete nearest-rank median).What changes are included in this PR?
tdigest.rs: When no compression has occurred (self.count == self.centroids.len()), bypass the t-digest interpolation and useexact_quantileinstead. This method uses the nearest-rank (ceiling) method:index = ceil(q * n) - 1, which returns an actual observed data value — matching Databricks'percentile_approx/approx_percentilesemantics.Test expectation updates: Updated snapshot and SQL logic test expectations across:
datafusion/core/tests/dataframe/mod.rs—window_using_aggregatessnapshotdatafusion/sqllogictest/test_files/aggregate.slt—approx_median,approx_percentile_cont, andapprox_percentile_cont_with_weighttest expectationsdatafusion/sqllogictest/test_files/aggregate_skip_partial.slt—approx_medianwith grouping, nulls, and filtersdatafusion/sqllogictest/test_files/metadata.slt—approx_median(distinct id)on small tableAre these changes tested?
Yes. All existing tests have been updated to reflect the new behavior. The key tests are:
window_using_aggregates— window function withapprox_medianover varying frame sizesaggregate.slt—approx_percentile_contat various percentiles (0.5, 0.95), including Float16/Float64/decimal types, with and without weightsaggregate_skip_partial.slt—approx_medianwithGROUP BY, nullable columns, andFILTERclausesmetadata.slt—approx_median(distinct id)regression testAre there any user-facing changes?
Yes.
approx_percentile_cont,approx_median, andapprox_percentile_cont_with_weightwill now return exact nearest-rank values (matching Databricks behavior) when the input dataset is small enough that no t-digest compression occurs (fewer than ~100 values per group by default). For larger datasets where compression happens, the existing t-digest approximation behavior is unchanged.This means
approx_medianandpercentile_cont(0.5)may now return different values for small datasets — this is expected and consistent with how Databricks distinguishes approximate vs exact percentile semantics.